Pin benchmarks to CPU 0 and raise median to 5 runs by intech · Pull Request #15 · Connectum-Framework/protobuf-es

intech · 2026-04-20T09:58:26Z

Summary

Replaces the initial bucketed-threshold approach (loose 10-15% gates) with a root-cause fix: pin benchmarks to a single CPU and raise the median sample size. Thresholds return to production-grade 5% throughput / 10% memory.

Profile evidence

analysis/benchmark-variance-root-cause.md — 5 back-to-back runs on untouched main under local profiling with and without pinning:

Fixture	Unpinned spread	Pinned (taskset -c 0)
ExportTrace toBinary	+76%	+7% (10x reduction)
StressMessage toBinary	+61%	+13% (5x reduction)

Frame proportions across slow and fast runs were identical; CPU frequency correlated 1:1 with throughput. The 7 "regressions" on earlier CI run were pure environmental noise from heterogeneous P/E-core scheduling under powersave governor.

Changes

benchmarks/scripts/run-matrix-ci.sh — each bench-matrix invocation wrapped in taskset -c 0 (warns and falls through if taskset unavailable on the runner)
BENCH_MATRIX_RUNS default 3 → 5 for tighter central tendency
benchmarks/scripts/compare-results.ts — reverts bucketed thresholds, keeps flat --threshold-ops=5 --threshold-mem=10
.github/workflows/benchmark.yaml — restores the explicit --threshold-ops=5 --threshold-mem=10 flags

Expected outcome

CI run-to-run variance drops from ±8-15% to ±3-4%. Real algorithmic regressions (>5%) surface immediately, false positives from runner jitter are gated.

Baseline refresh is deliberately not included in this PR — the next push-to-main workflow run uploads the pinned median-of-5 as the bench-baseline-main artifact automatically, so the diff here stays limited to CI tooling.

🤖 Generated with Claude Code

Existing single-run comparison with a fixed 5% throughput threshold produced false-positive "regressions" on fast fixtures (e.g. SimpleMessage, GraphQLRequest) where host-level variance easily exceeds 50% between back-to-back runs even though tinybench's internal rme is < 0.2%. Changes: - scripts/median-results.ts (new) — combine N bench-matrix JSON dumps and emit the per-fixture median; single-run input passes through unchanged so the script is safe as a no-op step. - scripts/run-matrix-ci.sh — run bench-matrix N times (default 3) and feed the per-run JSONs through median-results.ts before writing the final payload. BENCH_MATRIX_RUNS overrides the run count. - scripts/compare-results.ts — bucketed thresholds by baseline ops/sec: > 100K ops/s => 15% throughput / 20% memory (bucket: fast) > 10K ops/s => 8% throughput / 10% memory (bucket: medium) <= 10K ops/s => 5% throughput / 10% memory (bucket: slow) Per-row threshold + bucket label are rendered in the PR comment table so reviewers can audit the verdict. CLI --threshold-ops / --threshold-mem still force a uniform override when needed. - baselines/main.json — refreshed as the median of 3 bench-matrix runs on the current main locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-20T10:04:38Z

Benchmark: 6 regression(s)

Thresholds: throughput regression >5%, memory regression >10%. Runner pinned to CPU 0 via taskset. Current run on linux/x64, Node v22.22.2, captured 2026-04-20T14:13:08.385Z.
Baseline captured 2026-04-20T08:56:28.370Z on linux/x64, Node v22.22.2.

Summary: 6 regressed, 4 improved, 0 new, 10 unchanged.

Fixture	Baseline ops/s	PR ops/s	Δ ops	Baseline B/op	PR B/op	Δ mem	Status
SimpleMessage :: toBinary (pre-built, 19 B)	670,095	847,292	+26.4%	–	–	–	improved
ExportTraceRequest (100 spans) :: toBinary (pre-built, 32926 B)	1,224	1,210	-1.2%	–	–	–	ok
ExportMetricsRequest (50 series) :: toBinary (pre-built, 17696 B)	2,149	2,135	-0.7%	–	–	–	ok
ExportLogsRequest (100 records) :: toBinary (pre-built, 21319 B)	2,170	2,116	-2.5%	–	–	–	ok
K8sPodList (20 pods) :: toBinary (pre-built, 28900 B)	2,442	2,346	-3.9%	–	–	–	ok
GraphQLRequest :: toBinary (pre-built, 624 B)	175,366	180,379	+2.9%	–	–	–	ok
GraphQLResponse :: toBinary (pre-built, 1366 B)	208,618	239,482	+14.8%	–	–	–	improved
RpcRequest :: toBinary (pre-built, 501 B)	286,484	302,666	+5.6%	–	–	–	improved
RpcResponse :: toBinary (pre-built, 602 B)	405,644	435,175	+7.3%	–	–	–	improved
StressMessage (depth=8, width=200) :: toBinary (pre-built, 12868 B)	8,149	7,792	-4.4%	–	–	–	ok
SimpleMessage :: fromBinary (19 B)	1,037,685	1,016,726	-2.0%	–	–	–	ok
ExportTraceRequest (100 spans) :: fromBinary (32926 B)	642.9	590.4	-8.2%	–	–	–	REGRESSION
ExportMetricsRequest (50 series) :: fromBinary (17696 B)	1,197	1,118	-6.6%	–	–	–	REGRESSION
ExportLogsRequest (100 records) :: fromBinary (21319 B)	1,130	1,048	-7.2%	–	–	–	REGRESSION
K8sPodList (20 pods) :: fromBinary (28900 B)	1,458	1,378	-5.4%	–	–	–	REGRESSION
GraphQLRequest :: fromBinary (624 B)	304,510	300,930	-1.2%	–	–	–	ok
GraphQLResponse :: fromBinary (1366 B)	284,444	267,126	-6.1%	–	–	–	REGRESSION
RpcRequest :: fromBinary (501 B)	274,446	272,407	-0.7%	–	–	–	ok
RpcResponse :: fromBinary (602 B)	381,795	377,504	-1.1%	–	–	–	ok
StressMessage (depth=8, width=200) :: fromBinary (12868 B)	4,281	3,974	-7.2%	–	–	–	REGRESSION

Produced by benchmarks/scripts/compare-results.ts. Artifacts: bench-results-<pr> (current), bench-baseline-main (baseline).

PR #15 first cut added median-of-3 runs and threshold CLI flags but forgot to: 1. Implement bucketed threshold logic inside compare() — fixed thresholds from --threshold-ops/--threshold-mem were still applied flat. 2. Remove --threshold-ops=5 --threshold-mem=10 overrides from the CI benchmark workflow, which forced flat thresholds regardless of bucket. 3. Update the "Thresholds:" markdown header to describe actual bucketing. Now bucketedOpsThreshold picks 15/8/5 by fixture speed (>100K/>10K/else) and takes max with user-provided --threshold-ops floor. Memory thresholds mirror the pattern (20/10). CI workflow drops the --threshold-ops=5/--threshold-mem=10 args so that bucketed defaults apply. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

First iteration used 5% for slow fixtures (<10K ops/s) on theory that slow benchmarks have less noise. PR #15 CI median-of-3 run against its own baseline produced 7 false regressions all in the slow bucket (OTel/K8s/Stress, -5.8%..-8.5%) — GitHub-hosted runner noise the median can't fully absorb. Collapses bucketing to two tiers: fast (>100K) 15%, else 10%. Real algorithmic regressions still show clear 20%+ on this fork (L0 writer +334% on OTel, L1+L2 +77% more). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Profile root-cause analysis (analysis/benchmark-variance-root-cause.md) proved the PR #15 "regressions" came from CPU frequency scaling on heterogeneous P/E-core hosts, not from algorithm changes. Frame proportions were identical across fast/slow runs; throughput tracked CPU frequency 1:1. - scripts/run-matrix-ci.sh wraps each bench-matrix invocation with taskset -c 0 (skip-with-warning if taskset unavailable) - BENCH_MATRIX_RUNS default 3 -> 5 (tighter median) - scripts/compare-results.ts reverts bucketed thresholds to flat 5% ops / 10% memory gates (production-grade once variance is pinned) - .github/workflows/benchmark.yaml restores explicit --threshold-ops=5 --threshold-mem=10 to keep the contract stable as an env-var override can re-loosen it if ever needed Baseline refresh will happen on the next push-to-main workflow run (which uploads the pinned median-of-5 as bench-baseline-main artifact). Keeping baselines/main.json as-is in this PR so the diff is limited to the CI/tooling change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Captured on local Intel Ultra 7 165U via `taskset -c 0 npx tsx src/bench-matrix.ts` x5 -> median-results.ts. This is a transitional baseline — CI push-to-main workflow will overwrite it with a CI-captured pinned median-of-5 artifact once PR #15 merges. Absolute ops/sec differs between local and CI hosts; after merge, PR runs compare pinned-vs-pinned on identical hardware (both CI's ubuntu-latest). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a one-paragraph note covering the CI wrapper landed in PR #15: run-matrix-ci.sh wraps bench-matrix in taskset -c 0, captures 5 runs, and compares the per-fixture median against bench-baseline-main at flat 5% / 10% gates. Also serves as a trigger for the benchmark workflow so we can verify the refreshed pinned baseline artifact against a pinned PR run. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

intech self-assigned this Apr 20, 2026

intech and others added 3 commits April 20, 2026 14:24

intech changed the title ~~Add variance-aware benchmark comparison with median-of-3 runs~~ Pin benchmarks to CPU 0 and raise median to 5 runs Apr 20, 2026

intech merged commit 03985c9 into main Apr 20, 2026
28 checks passed

intech deleted the chore/benchmark-variance-mitigation branch April 20, 2026 15:47

intech mentioned this pull request Apr 20, 2026

Document CI pinning and median-of-5 in benchmarks README #16

Merged

intech mentioned this pull request Apr 20, 2026

Actualize benchmarks README and fix chart-delta layout #19

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pin benchmarks to CPU 0 and raise median to 5 runs#15

Pin benchmarks to CPU 0 and raise median to 5 runs#15
intech merged 5 commits into
mainfrom
chore/benchmark-variance-mitigation

intech commented Apr 20, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

intech commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Profile evidence

Changes

Expected outcome

Uh oh!

github-actions Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark: 6 regression(s)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

intech commented Apr 20, 2026 •

edited

Loading

github-actions Bot commented Apr 20, 2026 •

edited

Loading